Frontiers in Bioinformatics — Latest Matching Preprints

1

Associations between spatial distribution of immune cell subsets and clinical outcomes in patients with advanced melanoma treated with immune checkpoint inhibitors: results from the PUMA challenge

Schuiveling, M.; Liu, H.; Eek, D.; Hanusov, M.; van Duin, I.; ter Maat, L. S.; van der Weerd, J. C.; van den Berkmortel, F. W. P. J.; Blank, C. U.; Breimer, G. E.; Burgers, F. H.; Boers-Sonderen, M.; van den Eertwegh, A. J. M.; de Groot, J. W.; Haanen, J. B. A. G.; Hospers, G. A. P.; Kapiteijn, E.; Piersma, D.; Simkens, L. H. J.; Westgeest, H. M.; Schrader, A. M. R.; van Diest, P. J.; Lv, J.; Zhu, Y.; Tenorio, C. G. C.; Chohan, B. S.; Eastwood, M.; Raza, S. E. A.; Torbati, N.; Meshcheryakova, A.; Mechtcheriakova, D.; Mahbod, A.; Adams, D.; Galdran, A.; Pluim, J. P. W.; Blokx, W. A. M.; Suijker

2026-03-10 oncology 10.64898/2026.03.09.26347935 medRxiv

Top 0.1%

4.8%

Show abstract

Patients with advanced melanoma are treated with immune checkpoint inhibitors (ICIs), yet less than 50% of patients achieve a durable response while all patients are exposed to the risk of severe side effects. Tumor-infiltrating lymphocytes (TILs) in pathology images are associated with ICI outcomes, but manual assessment is subjective. In addition, the predictive value of other immune cell subsets, including plasma cells, neutrophils, histiocytes, and melanophages, remains unclear. We organized the Panoptic segmentation of nUclei and tissue in advanced MelanomA (PUMA) challenge to evaluate whether the spatial localization of TILs and other immune cell subsets on melanoma H&E slides collected before start of treatment was associated with treatment outcomes. Algorithm performance was evaluated on a hidden test set, after which top-ranked algorithms were applied to pre-treatment metastatic whole-slide images from a large, multicenter cohort of patients with advanced melanoma treated with first-line ICIs (n=1102). Automatically quantified tissue features and immune cell subsets were then associated with clinical outcomes. Top-performing algorithms improved detection of immune cell subsets, although accuracy for rare classes remained limited. Across challenge participants, TIL density showed the most consistent association with treatment response and survival. Associations for stromal TILs were weaker, while plasma cells, histiocytes, melanophages, neutrophils, necrosis and blood vessels did not show independent associations with outcomes. Overall, the results from the PUMA challenge improved the state of the art of immune cell detection in melanoma histopathology and show that intra-tumoral lymphocytes are the immune cell subset most consistently associated with treatment response and survival. HighlightsO_LIWe organized the first melanoma-specific tissue and nuclei segmentation competition C_LIO_LIWinning algorithms were applied to 1102 whole-slide images for biomarker analysis C_LIO_LIIntra-tumoral TILs were associated with response to immune checkpoint inhibitors C_LIO_LIOther immune cell subsets showed no independent association with treatment outcomes C_LIO_LITissue segmentation on WSIs was limited by low heterogeneity in training data. C_LI Graphical abstract O_FIG O_LINKSMALLFIG WIDTH=200 HEIGHT=140 SRC="FIGDIR/small/26347935v1_ufig1.gif" ALT="Figure 1"> View larger version (39K): org.highwire.dtl.DTLVardef@13838e4org.highwire.dtl.DTLVardef@1f34a6org.highwire.dtl.DTLVardef@b9a65borg.highwire.dtl.DTLVardef@58d300_HPS_FORMAT_FIGEXP M_FIG C_FIG

2

Benchmark of biomarker identification and prognostic modeling methods on diverse censored data

Fletcher, W. L.; Sinha, S.

2026-04-01 bioinformatics 10.64898/2026.03.29.715113 medRxiv

Top 0.1%

4.0%

Show abstract

The practices of identifying biomarkers and developing prognostic models using genomic data has become increasingly prevalent. Such data often features characteristics that make these practices difficult, namely high dimensionality, correlations between predictors, and sparsity. Many modern methods have been developed to address these problematic characteristics while performing feature selection and prognostic modeling, but a large-scale comparison of their performances in these tasks on diverse right-censored time to event data (aka survival time data) is much needed. We have compiled many existing methods, including some machine learning methods, several which have performed well in previous benchmarks, primarily for comparison in regards to variable selection capability, and secondarily for survival time prediction on many synthetic datasets with varying levels of sparsity, correlation between predictors, and signal strength of informative predictors. For illustration, we have also performed multiple analyses on a publicly available and widely used cancer cohort from The Cancer Genome Atlas using these methods. We evaluated the methods through extensive simulation studies in terms of the false discovery rate, F1-score, concordance index, Brier score, root mean square error, and computation time. Of the methods compared, CoxBoost and the Adaptive LASSO performed well in all metrics, and the LASSO and elastic net excelled when evaluating concordance index and F1-score. The Benjamini-Hoschberg and q-value procedures showed volatile performances in controlling the false discovery rate. Some methods performances were greatly affected by differences in the data characteristics. With our extensive numerical study, we have identified the best performing methods for a plethora of data characteristics using informative metrics. This will help cancer researchers in choosing the best approach for their needs when working with genomic data.

3

Methodological and Clinical Validation of TholdStormDX v0.0.1: An Advanced Stochastic Engine for the Optimization of Thresholds and Multimarker Panels Applied to Oncology

Reinosa, R.

2026-04-27 oncology 10.64898/2026.04.24.26351692 medRxiv

Top 0.1%

3.6%

Show abstract

IntroductionThe translation of biomarkers into binary clinical decisions requires the determination of precise cut-off points. This study validates the TholdStormDX v0.0.1 tool, a mathematical engine that employs Dual Annealing, 2- and 4-parameter logistic fitting, and vectorized Monte Carlo simulations for panel optimization under Boolean OR logic. MethodsThe tool was evaluated using datasets from four diagnostic domains (Pulmonary Nodules, Hepatocellular Carcinoma [HCC], Cervical Cancer, and Breast Cancer), along with a prognosis-oriented analytical context (Breast Cancer). Validation followed a strict workflow: characterization and selection of the best individual and combined thresholds in the Training (Train) and Validation (Val) sets, using the Test set in a completely independent manner, solely to assess the models performance and generalizability. ResultsThe tool enabled precise derivation of cut-off points for both individual biomarkers and multivariable combinations. Evaluation on the Test set objectively demonstrated in which scenarios a single biomarker outperforms a complex panel, promoting clinical parsimony. For example, in Breast Cancer diagnosis, an individual predictor outperformed the optimized panel (Sensitivity: 0.953 / Specificity: 0.952 in Test); conversely, in Hepatocellular Carcinoma, the multivariable combination showed superior performance compared to the single marker (Sens: 0.707 / Spe: 0.718 in Test). Additionally, the self-auditing system effectively flagged metric degradation when noisy variables were included, preventing potential issues. ConclusionTholdStormDX v0.0.1 proves to be a robust and transparent bioinformatics platform for deriving clinical thresholds. Its main contribution lies in mitigating local minima and promoting clinical parsimony, enabling researchers to objectively identify when a single biomarker is sufficient and when a panel provides real added value. Furthermore, it transforms the problem of biological noise into a safety feature: by systematically warning about algorithmic instability, it prevents overfitting and ensures the clinical viability of medical decisions. AvailabilityThe software is free and distributed under the GNU GPLv3 license. TholdStormDX v0.0.1 is written in Python, and its source code is available at the following GitHub address: https://github.com/roberto117343/TholdStormDX. Contactroberto117343@gmail.com

4

IMMREP25: Unseen Peptides

Richardson, E.; Aarts, Y. J. M.; Altin, J. A.; Baakman, C. A. B.; Bradley, P.; Chen, B.; Clifford, J.; Dhar, M.; Diepenbroek, D.; Fast, E.; Gowthaman, R.; He, J.; Karnaukhov, V.; Marzella, D. F.; Meysman, P.; Nielsen, M.; Nilsson, J. B.; Deleuran, S. N.; Parizi, F. M.; Pelissier, A.; Pierce, B. G.; Rodriguez Martinez, M.; Roran A R, D.; Saravanakumar, S.; Shao, Y.; Smit, N.; Van Houcke, M.; Visani, G. M.; Wan, Y.-T. R.; Wang, X.; Woods, L.; Wuyts, S.; Xiao, C.; Xue, L. C.; IMMREP25 Participant Consortium, ; Barton, J.; Noakes, M.; May, D. H.; Peters, B.

2026-04-01 bioinformatics 10.64898/2026.03.30.715276 medRxiv

Top 0.1%

3.5%

Show abstract

T cell receptors (TCRs) can bind to peptides presented by MHC molecules (pMHC) as a first step to trigger a T cell response. Reliable approaches to predict TCR:pMHC binding would have broad applications in clinical diagnostics, therapeutics, and the fundamental understanding of molecular interactions. IMMREP is a community organized series of prediction contests that asks participants to predict TCR:pMHC binding on unpublished datasets. Previous iterations in 2022 and 2023 showed multiple approaches can predict TCR-pMHC binding with significant accuracy (median AUC_0.1[≥]0.7) for peptides where experimental data is available ("seen" peptides). In contrast, models did not outperform random guessing for peptides that have no such data available ("unseen" peptides). Here we report on the results of IMMREP25, which focused solely on unseen peptides in order to evaluate the cutting edge of the field. We received 126 named submissions predicting the specificity of 1,000 TCRs against twenty unseen peptides restricted by one of two MHC molecules (HLA-A*02:01 and HLA-B*40:01). The best performing methods showed a macro-AUC_0.1 of 0.60, significantly better than random, demonstrating significant advances in the field. The top performing methods incorporated structural modeling into their approach, indicating that especially for unseen peptides, a structural understanding aids in the prediction of TCR:pMHC interactions. The results from this benchmark highlight the significant challenges remaining for TCR:pMHC predictions and will inform future method development.

5

Combining amino acid frequency and 1D convolutional neural network embeddings for the identification of protein-protein interactions using a random forest classifier

Sindhi, N. A.; Pawar, N.; Dixson, J.; Garcia, D.

2026-05-18 bioinformatics 10.64898/2026.05.15.725340 medRxiv

Top 0.1%

3.3%

Show abstract

Predicting protein-protein interactions is a fundamental problem in molecular biology. Experimental approaches for identifying protein-protein interactions are time-consuming and labor-intensive, motivating the development of efficient computational alternatives, including machine learning-based methods. However, conventional machine learning methods often rely on manually engineered features that require substantial domain expertise. In this study, we propose a two-stage framework to address these limitations. In the first stage, a one-dimensional convolutional neural network autoencoder is used to automatically learn latent representations from protein sequences. The quality of these features is evaluated through reconstruction error, reflecting how accurately the model reconstructs the original sequence. In the second stage, these learned features are combined with amino acid frequency-based features to form a hybrid feature set for predicting protein-protein interactions. A systematic comparison is performed between models trained on frequency features alone and those using a hybrid representation. The comparison showed that incorporating one-dimensional convolutional neural network-derived latent features improved the models performance of predicting protein-protein interactions. The dataset was split into training, validation, and test sets. Nested cross-validation was employed, with inner loops for hyperparameter tuning and outer loops for model selection. The random forest classifier achieved the best performance, with a mean receiver operating characteristic-area under curve of 0.91 and a test F1-score of 0.87. These results highlight the effectiveness of integrating deep feature learning with ensemble methods for predicting protein-protein interactions and build upon previous work focused on this fundamental problem. Author SummaryProtein-protein interactions are fundamental in all biological processes. However, predicting these interactions is a key problem in molecular biology. Computational approaches have been tested to address this problem. We applied a mix of machine learning and deep learning to gain insight into the qualities of proteins that engage in interaction. First, we trained a deep learning model, which automatically learned the primary sequence and characters related thereto, reducing bias in the actual prediction process. We combined these features, or latent representations, with amino acid frequency features of protein sequences, and called the two together "hybrid features." Then we performed a systematic comparison of amino acid frequency features-only with hybrid features, among four different machine learning classifiers. Our results suggest that the random forest classifier performed best among all four classifiers at predicting interactions between proteins. We propose that this approach could be used to improve efficiency in testing protein-protein interactions at the bench and may have applications to other biologically relevant molecular interactions.

6

Foundation cell segmentation models performance on live microscopy and spatial-omics data

Miao, Y.; Surguladze, N.; Lerner, J.; Poysungnoen, K.; Ariano, K.; Li, Y.; Zhu, Y.; Van Batavia, K.; Jepson, J.; Van De Klashorst, J.; Ni, B. Y. X.; Armstrong, A.; Rahman, R.; Horstmeyer, R.; Hickey, J. W.

2026-04-21 bioinformatics 10.64898/2026.04.18.719315 medRxiv

Top 0.1%

2.5%

Show abstract

Accurate cell segmentation is an essential step for quantitative analysis of biological imaging data. Recent advances in deep learning have led to the development of generalist segmentation models that perform robustly across multiple imaging modalities, including label-free phase contrast, fluorescence cell culture, and multiplexed fluorescence tissue imaging. However, systematic comparisons of these models at the level of downstream biological analysis remain limited. To address this gap, we evaluated several recent segmentation models, including Cellpose cyto3, Cellpose-SAM, {micro}SAM, and CellSAM, on phase contrast and fluorescence cell culture images. In addition, Mesmer and InstanSeg were included for benchmarking on multiplexed fluorescence tissue images generated using CO-Detection by IndEXing (CODEX). We found that Cellpose-SAM achieved strong performance on phase contrast images, while SAM-based models consistently performed well on fluorescence cell culture data. In contrast, no single model consistently outperformed others on CODEX datasets. Instead, each model exhibited distinct strengths and limitations, which led to differences in downstream analyses, including clustering and cell type identification. Together, our study emphasizes the importance of selecting segmentation models based on dataset characteristics and analytical goals, rather than relying on a single universal approach.

7

On the benchmarking of clustering algorithms and hyperparameter influence for cell type detection in single-cell RNA sequencing data.

Szmigiel, A.; Gesteira Costa Filho, I.; Campello, R. J. G. B.

2026-05-17 bioinformatics 10.1101/2025.08.20.671270 medRxiv

Top 0.1%

2.4%

Show abstract

Clustering single-cell RNA-seq (scRNA-seq) data and related protocols remains a major challenge due to high dimensionality, sparsity, and noise. Despite numerous benchmarking studies aiming to identify the most suitable clustering methods, many suffer from methodological flaws that can undermine their conclusions. A major challenge in benchmarking is selecting representative datasets that cover the diversity of scRNA-seq experiments and include laboratory-verified labels for reliable evaluation. Consistent preprocessing of all inputs to benchmarked algorithms is crucial, as it significantly impacts performance. Beyond selecting an algorithm, a thorough exploration of hyperparameters is also essential to assess robustness and identify configurations that maximize performance. We focus on proposing an improved benchmarking framework that addresses common methodological issues in prior studies. We illustrate our proposed methodology in a case study comparing the classic Leiden and Louvain clustering algorithms with extensive hyperparameters exploration on a carefully curated collection of real gold standard datasets. By evaluating clustering performance across different hyper-parameter selection scenarios, we show that benchmarking results can be misleading, either overestimating or underestimating performance depending on how the hyperparameter space is explored. In our illustrative case study, benchmarking results do not reveal any practically relevant performance differences between the Louvain and Leiden algorithms. In contrast, we show that overlooked factors such as graph construction and quality functions critically influence clustering outcomes, particularly un-der suboptimal settings of numerical hyperparameters--the neighbor-hood size k used for similarity graph construction and the resolution hyperparameter in graph-based clustering algorithms. While noticeable trends have been observed in terms of how different (dis)similarity functions affect performance, the impact of this choice is limited and, to some extent, overridden by the graph-building approach. Across different graphs, there is a noticeable trade-off between achieving optimal performance with ideally tuned numerical hyperparameters and maintaining robustness under more realistic, unsupervised, and suboptimal settings. All in all, the analysis of our illustrative benchmarking case study offers clear guidance and objective recommendations for practitioners in the field. Most importantly, as the main contribution of this manuscript, our proposed framework sets a foundation for more reliable scRNA-seq clustering evaluation and benchmarking in future studies.

8

Kernel Matrix Completion with Topological and Spectral Features for Multi-Modal Classification

Rinon, E. M.; Visaya, M. V.; Sambayan, R.

2026-04-22 bioinformatics 10.64898/2026.04.19.713528 medRxiv

Top 0.1%

2.1%

Show abstract

Kernel methods offer a robust framework for integrating multi-modal datasets into a unified representation, thereby facilitating more comprehensive data interpretation. In the presence of incomplete datasets, multiple kernel learning is employed to enhance the efficiency of data completion and integration. We investigate kernel-based approaches to address the incomplete-data problem with applications to yeast protein data. Biological data such as yeast proteins can be represented through multiple modalities, including gene expression profiles, amino acid sequences, three-dimensional structures, and protein interaction networks. We introduce a computational pipeline based on kernel matrix completion, in which topological data analysis (TDA) and persistent spectral analysis are incorporated into the classification setting. TDA captures geometric structure across scales while spectral descriptors reflect connectivity patterns through Laplacian eigenvalues. Kernel, topological, and spectral descriptors are used with support vector machines to discriminate between membrane and non-membrane yeast proteins. Empirical results show that the combined pipeline improves both kernel completion accuracy and ROC performance relative to baseline kernel-only approaches. The best-performing configuration achieves an ROC score of 0.8632 using the average of three kernels augmented with TDA features. These results demonstrate competitive performance relative to strong kernel-based baselines under incomplete data conditions. The proposed approach provides a unified approach for learning from incomplete heterogeneous data while enriching kernel representations with geometric and spectral information.

9

Developing SCL2205 : A Protein Sequence-based Spatial Modelling Dataset for the Protein Language Model Frontier

Ouso, D.; Pollastri, G.

2026-03-10 bioinformatics 10.64898/2026.03.08.710388 medRxiv

Top 0.1%

2.1%

Show abstract

Deep learning (DL) has advanced computational genome annotation tasks such as protein sub-cellular localisation (SCL) prediction. Nonetheless, its potential remains underutilised, primarily because of the limited availability of high-quality reference data and suboptimal input preparation strategies. In this study, we develop and analyse a high-quality dataset derived from the latest release of the universal protein knowledgebase (UniProtKB), designed to address existing challenges and support robust DL-based SCL modelling. The dataset was constructed through extensive quality preprocessing to ensure reliability, manual label mapping to enhance the quantity and diversity of the training data, and stringent partitioning to minimise data leakage. We validated the dataset using independent test sets, achieving up to 10.8% performance improvement, measured by the area under the precision-recall curve (PR-AUC), compared to the state-of-the-art (SoTA). Furthermore, we highlighted potential performance metric inflation in existing SoTA predictors by demonstrating, for the first time, at least 4.8% training-to-testing data leakage (pre-sequence representation) when using only 10% of the training set under homology augmentation (augmentation based on sequence similarity database searches; details in Sub-section 2.1), a commonly used data augmentation strategy in DL-based SCL prediction modelling. SCL2205 will efficiently support the development of robust, trustworthy, and generalisable DL-based SCL predictors, while minimising data leakage and promoting reproducibility. It is openly available under the Creative Commons Zero (CC0 1.0) licence on DRYAD and is conveniently deployed as a package on the Python Package Index - p-scldata.

10

Integrated Multi-Omics Analysis for the Identification of Disease-Associated Variations and Prognostic Biomarkers in Triple-Negative Breast Cancer (TNBC)

MANNEKUNTA, N.; NATRAJAN, E.

2026-05-06 bioinformatics 10.64898/2026.05.03.722461 medRxiv

Top 0.1%

2.1%

Show abstract

BackgroundTriple-negative breast cancer (TNBC) exhibits substantial molecular heterogeneity and lacks targeted receptor therapies. Single-omic approaches inadequately capture its regulatory complexity, necessitating integrated multi-omic frameworks to identify stable prognostic signatures. MethodsMatched transcriptomic and DNA methylation data from the TCGA-BRCA cohort were normalised and mathematically integrated to isolate disease-associated variations. A calibrated machine learning voting ensemble (comprising LightGBM, Random Forest, and Logistic Regression) was trained to predict clinical survival. Model generalisability was tested on an independent microarray cohort (GSE58812) using independent quantile normalisation. SHAP (SHapley Additive exPlanations) values provided biological interpretability. ResultsDifferential and integrative analyses identified a 47-gene master prognostic signature. The ensemble classifier achieved an external validation accuracy of 74.77% (AUC 0.590) on unseen clinical patients. SHAP analysis confirmed the biological directionality of these specific biomarkers in driving mortality. Hypergeometric pathway enrichment highlighted targetable metabolic and signalling networks. ConclusionsThis multi-omic machine learning pipeline identifies a highly prognostic 47-gene signature for TNBC. The model demonstrates strong cross-platform generalisability and offers interpretable clinical utility for stratifying patient risk and guiding future therapeutic target development.

11

Evaluating FoldX5.1 for MAVISp Stability Data Collection

Vliora, A.; Tiberti, M.; Papaleo, E.

2026-04-02 bioinformatics 10.64898/2026.03.31.715598 medRxiv

Top 0.1%

2.1%

Show abstract

MAVISp (Multi-layered Assessment of VarIants by Structure for proteins) is a structure-based framework for facilitating mechanistic interpretation of missense variants, with protein stability as one of its core analytical layers. When software tools are updated, a key consideration for database curation is whether the new version can be adopted without compromising compatibility with existing entries. This study evaluated the effect of replacing FoldX5 with FoldX5.1 on the results of the MAVISp stability workflow. We compared predicted changes in folding free energy for 539,809 shared variants across 119 proteins. We found high overall agreement with a mean Pearson correlation of 0.933 and a mean Cohen coefficient of 0.814. Most proteins showed strong concordance, whereas only three (NUPR1, TSC1, and TMEM127) showed poor agreement. The number of disagreements was higher at sites with low AlphaFold2 confidence for NUPR1 and TSC1. These outliers did not display systematic inter-version bias, as mean shifts in folding free energies between versions were minimal. Collectively, these findings support adopting FoldX5.1 for future MAVISp data collection. We will include a transition period, during which existing entries retain FoldX5 annotations until their scheduled annual update, while new or updated entries are processed with FoldX5.1. To facilitate this transition, the FoldX software version has been added as a new metadata annotation in the MAVISp database.

12

Evaluating Reference-Independent Pipelines for the Detection of Spreading Organisms in Metagenomic Datasets

Popov, N. S.; Panova, V. V.; Molchanova, M.; Gurov, S.; Lukashev, A. N.; Manolov, A.; Ilina, E. N.

2026-05-06 bioinformatics 10.64898/2026.05.03.722517 medRxiv

Top 0.1%

2.1%

Show abstract

The emergence of unidentified pathogens, or "Disease X," poses a significant threat to global health, necessitating the development of proactive surveillance strategies for the wildlife and human virosphere. Since novel viruses often lack universal genetic markers or known homologs, this study evaluates four reference-independent computational pipelines: coverage-based, k-mer-based, nucleotide clustering, and Large Language Model (LLM)-based designed to detect spreading organisms by comparing distinct metagenomic datasets. Using a real-world pandemic dataset of human nasopharyngeal RNA-seq runs and a semi-synthetic dataset enriched with divergent Egovirales sequences, we measured the sensitivity, selectivity, and computational efficiency of each approach. The coverage-based method proved most robust, consistently achieving 100% genome coverage of SARS-CoV-2 and maintaining high selectivity even at low viral concentrations, though it required extensive computational resources (20 days of CPU time for 2B reads). In contrast, the k-mer-based approach offered a tenfold reduction in execution time and high selectivity but was sensitive to data depletion, failing to detect targets at very low abundances. The clustering-based pipeline performed effectively at moderate concentrations but suffered from sequence fragmentation in sparse data, while the LLM-based method (using ViraLM), despite its efficiency, exhibited critically low selectivity due to current latent space partitioning limitations. These results demonstrate that while k-mer and LLM-based tools provide rapid screening capabilities, the coverage-based approach remains the most reliable for sensitive pathogen discovery. Ultimately, these reference-independent workflows are essential for illuminating metagenomic "dark matter" and establishing early warning systems for emerging infectious diseases

13

nSIGHT™: A Data Discovery Platform for Visualization, Integration and Retrospective Analysis of Multimodal Clinical Research Data

Zia, M. K.; Plessinger, B.; Eng, K. H.; Flierl, A.; Wilbert, M.; Jans, K.; Whalen, P.; Mullin, S.; Ohm, J.; Singh, A. K.; Farrugia, M.; Morrison, C.; Darlak, C. J.; Seshadri, M.

2026-03-11 oncology 10.64898/2026.03.10.26347202 medRxiv

Top 0.1%

2.1%

Show abstract

The lack of interoperability among clinical and research data systems poses a significant barrier to cancer researchers interested in evaluating novel mechanistic hypotheses or translating innovative treatment strategies from the laboratory to the clinic. To address this gap in knowledge, we developed an innovative, web-based, data discovery, visualization and analysis tool (nSight) that allows researchers to quickly and easily query clinical/research data and construct de-identified cancer cohorts. Guiding principles for development of the tool were focused on ease of use, intuitiveness, self-service, and presentation of structured but de-identified data to the end user. nSight provides users with information on patient demographics, disease histology, diagnostic procedures and therapeutic interventions, timeline of disease progression/recurrence, along with available molecular profiling/sequencing data and indicators of participation in epidemiologic or lifestyle studies for specific cancer patient cohorts. The platform also allows users to obtain summary statistics based on demographic, histologic and clinical factors as well as perform basic survival analysis using Kaplan-Meier curves between specific patient cohorts. nSight is an intuitive, user-friendly tool that enables visualization, integration and analysis of multimodal clinical and research data without placing high technical demands or time constraints on researchers. The platform is designed for research feasibility assessment, cohort development, and retrospective data discovery, which in turn should help investigators identify potential study populations and explore novel hypotheses.

14

A Multi-Omics Computational Pipeline for Systematic Discovery of Retired Self-Antigens as Cancer Vaccine Targets

Wang, V.; Deng, S.; Aguilar, R.

2026-04-22 genetic and genomic medicine 10.64898/2026.04.20.26351288 medRxiv

Top 0.1%

1.9%

Show abstract

BackgroundThe retired antigen hypothesis, introduced by Tuohy and colleagues, proposes that tissue-specific proteins expressed conditionally during early life or reproductive stages, then silenced in normal aging tissue, represent safe and effective cancer vaccine targets when re-expressed in tumors. To date, discovery of retired antigens has relied entirely on hypothesis-driven wet lab work, limiting throughput. MethodsHere we present RADAR (Retired Antigen Discovery and Ranking), a multi-omics computational pipeline implemented on a standard server that systematically identifies retired antigen candidates. RADAR comprises four core discovery layers integrating: 1) The Genotype-Tissue Expression Portal (GTEx) normal tissue expression, 2) TCGA tumor re-expression, 3) DNA methylation, and 4) miRNA regulatory networks, each applied sequentially to identify genes exhibiting the epigenetic and post-transcriptional hallmarks of tissue-specific retirement followed by tumor re-activation. Candidate characterization is further supported by three automated modules: 1) protein-level safety screening via the Human Protein Atlas, 2) molecular subtype enrichment analysis, and 3) cross-cancer confirmation, which execute automatically when the relevant data are available for the selected cancer type. ResultsThe pipeline independently validated known targets including alpha-lactalbumin (LALBA, the basis of the Tuohy Phase 1 triple-negative breast cancer vaccine trial) and anti-Mullerian hormone (AMH), consistent with Tuohys ovarian cancer vaccine program targeting AMHR2, and rediscovered multiple known cancer-testis antigens (MAGEA1, MAGEC1, SSX1) as positive controls. Among 4,664 initial candidates derived from GTEx, the pipeline identified 20 high-confidence retired antigen candidates passing all filters. DCAF4L2, COX7B2, TEX19, and CT83 emerge as the highest-priority novel candidates for experimental validation, demonstrating zero expression in critical somatic organs, strong epigenetic silencing, and significant re-expression across multiple cancer types. ConclusionRADAR provides the first systematic computational framework for retired antigen discovery, offering a reproducible and scalable approach to expanding the cancer immunoprevention pipeline beyond individually characterized targets. The pipeline is fully reproducible, requires no specialized hardware, and is immediately extensible to additional TCGA cancer types.

15

On the predictability of progression-free survival in ovarian cancer from NanoString gene expression data

Van Kleunen, L. B.; Bowman, G.; Stockman, S. E.; Townsend, H. A.; Barrios, L.; Jordan, K. R.; Wolsky, R. J.; Behbakht, K.; Sikora, M. J.; Richer, J. K.; Hu, J.; Bitler, B. G.; Clauset, A.

2026-04-24 cancer biology 10.64898/2026.04.22.719856 medRxiv

Top 0.1%

1.9%

Show abstract

In the treatment of high grade serous ovarian cancer (HGSC), patients initially diagnosed with unresectable tumors are first treated with neoadjuvant chemotherapy (NACT) to reduce tumor burden prior to surgery. Analysis of matched pre- and post-NACT samples from the same patients enables the investigation of chemotherapy impacts and the biomarkers of progression. Although the tumor immune microenvironment (TIME) has increasingly been recognized as critical in shaping the development and progression of HGSC, we lack a comprehensive understanding of how chemotherapy remodels the TIME. Previous studies have found evidence for a general inflammatory response post-NACT, despite inconsistencies regarding which differentially expressed genes and pathways are implicated. We combine matched NanoString gene expression data from multiple sources to create a large dataset of matched pre- and post- NACT samples (N=83, with 29 novel to this study) and investigate reproducibility. Further, we use machine learning methods to investigate whether patient progression-free survival (PFS) can be predicted from the observed impact of chemotherapy on the TIME as represented by the comprehensive set of NanoString features. We find overall low predictability of PFS from all NanoString features, suggesting that previous results may have been limited by small sample size effects and that larger datasets are needed to identify more generalizable and translatable findings. We identify a set of differential expression features that are the most important for predicting patient outcomes that can be validated in future computational and biological studies. Author summaryA subset of patients with high grade serous ovarian cancer are treated with chemotherapy before surgery to reduce tumor burden. We investigate a large dataset of samples taken before and after chemotherapy. These matched samples enable an investigation of how the environment around tumors, for example immune cell infiltration, reacts to chemotherapy, providing insights into biomarkers for treatment response and treatments that could complement chemotherapy. This larger dataset only partially replicates results from previous studies, while also providing new insights. Machine learning models designed to predict the time to patient recurrence from available biomarkers indicate that they are not strongly predictive of patient outcomes, in contrast to past studies. These results suggest that larger datasets are needed. We identify a set of genes that change with chemotherapy and are indicative of and potentially useful for predicting time to disease recurrence and can be further investigated.

16

Evaluation of a multiplexed tiling PCR scheme for whole-genome amplification of hepatitis B virus using Oxford Nanopore sequencing

Brate, J.; Grande, E. G.; Pedersen, B. N.; Frengen, T. G.; Stene-Johansen, K.

2026-03-31 molecular biology 10.64898/2026.03.28.714721 medRxiv

Top 0.2%

1.9%

Show abstract

Here we evaluated the performance of a previously published tiling PCR primer scheme by Ringlander et al. (2022) for whole-genome amplification of Hepatitis B virus (HBV) in combination with Oxford Nanopore sequencing. The primer set originally developed for Ion Torrent sequencing was adapted by removing platform-specific adapters and tested using clinical serum or plasma samples submitted for routine HBV genotyping and resistance testing. Two multiplexing strategies were compared: a single PCR pool containing all primers and a two-pool strategy with non-overlapping amplicons. Sequencing reads were processed using a Nanopore analysis pipeline, and genome coverage and amplicon performance were compared across samples spanning a wide Ct range and representing HBV genotypes A-E. Across all samples, the median genome coverage was approximately 50%, although recovery varied widely, ranging from complete failure to nearly full genomes. Combining all primers into a single PCR reaction, or separating overlapping amplicons into different reactions, had little overall impact on genome recovery, and no consistent differences between the two pooling strategies were observed. In contrast, amplification efficiency differed markedly between individual amplicons. Amplicons 1-5 generally produced higher sequencing depth, whereas amplicons 6-10 frequently showed low coverage and contributed to incomplete genome recovery. Genome coverage was strongly associated with Ct values, with higher coverage observed in samples with lower Ct values, while coverage was broadly similar across genotypes. These results demonstrate that the Ringlander et al. primer scheme can be adapted for multiplex PCR and Nanopore sequencing of HBV, but uneven amplicon performance limits consistent full-genome recovery and highlights the need for further optimization of HBV tiling PCR designs.

17

Explainable, Lightweight Deep Learning for Colorectal Cancer Microsatellite Instability Screening in Low-Resource Settings

Adegbosin, O. T.; Patel, H.

2026-04-20 oncology 10.64898/2026.04.18.26350809 medRxiv

Top 0.2%

1.9%

Show abstract

BackgroundMicrosatellite stability status determination is important for prognostication and therapeutic decision making in colorectal cancer management, but the conventional methods for this assessment are not readily available, especially in low- and middle-income countries. Deep learning (DL) models have been proposed for addressing this problem; however, potential computational cost due to model complexity and inadequate explainability may limit their adoption in low-resource settings. This study explored the potential of explainable lightweight models for detection of microsatellite instability in colorectal cancer. MethodsDL models were trained using a public dataset of colorectal cancer histology images and then used to classify a set of test images into one of two classes: microsatellite instability or microsatellite stability. The models were compared for efficiency. Gradient-weighted class activation mapping (Grad-CAM) was used to interpret the models decision making. ResultsThe simpler convolutional neural network (CNN) trained from scratch had modest performance (accuracy=0.757, area under receiver-operating characteristic curve [AUROC]=0.840). With an attention mechanism added, these values increased, but specificity and sensitivity reduced. Pretrained models performed better than the ones trained from scratch, and EfficientNet_B0 had the best balance of high performance and low computational requirements (accuracy=0.936, AUROC=0.990, negative predictive value=0.923, specificity=0.953, 4,010,000 trainable parameters, 0.38 gigaFLOPs). However, a simple CNN model with attention mechanism had the best interpretability based on Grad-CAM. ConclusionThis study demonstrated that DL models that are lightweight when compared to previously proposed ones can be useful for colorectal cancer microsatellite instability screening in resource-limited settings while balancing performance and computational efficiency.

18

Integrative Bioinformatics Approach to Identify Prognostic Gene Signatures for Risk Stratification in Thyroid Carcinoma

Malik, S.; Raghava, G. P. S.

2026-04-27 bioinformatics 10.64898/2026.04.23.720344 medRxiv

Top 0.2%

1.8%

Show abstract

Thyroid cancer is a heterogeneous malignancy with variable outcomes, highlighting the need for reliable biomarkers and effective risk stratification. In this study, we implemented a multi-step integrative framework to identify distinct prognostic biomarker sets using transcriptomic data from 572 thyroid cancer patients. Correlation analysis followed by false discovery rate (FDR) correction revealed significant associations of genes. Notably, MAFF (r = 0.25, p = 1.34x10-, FDR = 2.46x10-), NR4A3 (r = 0.24, p = 1.26x10-, FDR = 9.25x10-), and SRF showed strong positive correlations, whereas LOC728264 (r = -0.21, p = 7.39x10-, FDR = 6.36x10-) and VAMP1 (r = -0.20, p = 1.20x10-, FDR = 1.3x10-) exhibited negative correlations with OS. Univariate Cox regression identified several survival-associated genes, including TMEM90B (HR = 10.66, p = 2.88x10-) and PTH1R (HR = 9.88, p = 5.55x10-). LASSO regression further identified 31 key prognostic genes, including 13 potential drug targets predominantly functioning as inhibitors. Machine learning models based on seven independent 20-gene biomarker sets effectively predicted Class 0 (0-1 years), Class 1 (1-3 years), Class 2 (3-5 years), and Class 3 (>5 years), achieving AUC values of 0.91-0.94 and Kappa up to 0.76. An ensemble model further improved prediction (AUC = 0.95, Kappa = 0.72). Incorporating clinical variables (age, gender, stage) enhanced model performance (AUC = 0.96, Kappa = 0.80). Reduced 10- and 5-gene subsets demonstrated consistent yet slightly lower performance (AUC = 0.90 and 0.86, respectively). Collectively, the 20-gene set exhibited the strongest predictive and prognostic potential, highlighting the importance of integrating molecular and clinical features for risk stratification in thyroid cancer.All data and code are openly available (https://github.com/raghavagps/THCA_prognostic_biomarkers), supporting future research in thyroid cancer prediction.

19

Predicting Phage Host Interactions Across Taxonomic Levels: A Systematic Review and Meta-Analysis for Microbial Ecology

Romero-Calle, D. X.; Yucra Rojas, M.; Middelboe, M.

2026-04-30 microbiology 10.64898/2026.04.28.721508 medRxiv

Top 0.2%

1.8%

Show abstract

The prediction of phage-host interactions is key for several applications in biotechnology, medicine, and microbial ecology. Wide studies in machine learning tools have allowed the exploration of these interactions across multiple taxonomic levels. A systematic review and meta-analysis were conducted on 570 records retrieved from PubMed, Scopus, and Web of Science. Eleven studies were selected for the meta-analysis, encompassing 61 datasets. Precision across taxonomic levels (Domain, Phylum, Class, Order, Family, Genus, Species) was evaluated for several prediction tools. Statistical tests, including the Shapiro-Wilk and ANOVA tests, were used. A mixed-effects meta-regression model was used to examine the impact of taxonomic subgroups on the prediction of the proportion of Correctly Predicted PHIs. The results indicated significant variability in the performance of prediction tools across taxonomic levels. Domain-level predictions exhibited near-perfect Proportion of Correctly Predicted PHIs (0.99), whereas finer resolutions (Family and Order) showed considerable variability, with average precision values of 0.682 and 0.775, respectively. The mixed-effects meta-regression analysis revealed that Family and Species taxonomic subgroups were associated with significant reductions in the prediction Proportion of Correctly Predicted PHIs with effect sizes of -0.1464 and -0.1944, respectively. Residual heterogeneity was negligible, indicating that the moderators adequately explained the variability in prediction precision. This study highlights the importance of selecting the appropriate prediction tool based on the desired taxonomic resolution. The findings emphasize the need for further refinement of prediction algorithms, particularly at the Family and Species levels, where tools exhibit the most variability. O_FIG O_LINKSMALLFIG WIDTH=200 HEIGHT=136 SRC="FIGDIR/small/721508v1_ufig1.gif" ALT="Figure 1"> View larger version (39K): org.highwire.dtl.DTLVardef@4105bforg.highwire.dtl.DTLVardef@e07c46org.highwire.dtl.DTLVardef@1ff139corg.highwire.dtl.DTLVardef@1608690_HPS_FORMAT_FIGEXP M_FIG O_FLOATNOGraphical Abstract.C_FLOATNO Overview of the systematic review and meta-analysis framework evaluating ML-based phage-host interaction prediction tools across taxonomic levels. C_FIG

20

Benchmarking Agentic Bioinformatics Systems for Complex Protein-Set Retrieval: A Coccolithophore Calcification Case Study

Zhang, X.

2026-04-02 bioinformatics 10.64898/2026.03.28.715041 medRxiv

Top 0.2%

1.8%

Show abstract

Large language model agents are increasingly used for bioinformatics tasks that require external databases, tool use, and long multi-step retrieval workflows. However, practical evaluation of these systems remains limited, especially for prompts whose target set is both large and biologically heterogeneous. Here, I benchmarked three agent systems on the same difficult retrieval task: downloading coccolithophore calcification-related proteins from UniProt across six mechanistically distinct categories, while producing category-separated FASTA files and supporting evidence. The compared systems were Codex app agents extended with Claude Scientific Skills, Biomni Lab online, and DeerFlow 2 with default skills only. Outputs were normalized at the UniProt accession level and compared category by category using overlap analysis, Venn decomposition, and a heuristic relevance assessment of each subset relative to the benchmark prompt. Across the six shared categories, Codex retrieved 2,118 proteins, DeerFlow 6,255, and Biomni 8,752 in a run. Codex showed the best balance between sensitivity and specificity: 92.4% of its proteins fell into subsets labeled high relevance and the remaining 7.6% into medium relevance. DeerFlow was substantially more exhaustive, but 43.8% of its proteins fell into low or low-medium relevance subsets. Biomni produced the largest sets, yet 69.5% of its proteins fell into low or low-medium relevance subsets, mainly due to broad expansion into generic calcium sensors, kinases, transcription factors, and poorly specific domain families. Category-specific analysis showed that Codex was the strongest primary source for inorganic carbon transport, calcium and pH regulation, vesicle trafficking, and signaling, whereas DeerFlow contributed valuable complementary matrix and polysaccharide candidates. A second run for each system also separated them strongly by repeatability: Codex had the highest within-system stability (mean category Jaccard 0.982; micro-Jaccard 0.974), DeerFlow was intermediate (0.795; 0.571), and Biomni was least stable (0.412; 0.319). These results suggest that for complex protein-family retrieval tasks, agent quality depends less on raw output volume than on prompt decomposition, taxonomic scoping, exact query generation, provenance-rich export artifacts, and repeated-run stability.